Pesquisa | Portal Regional da BVS

1.

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval.

Jin, Qiao; Kim, Won; Chen, Qingyu; Comeau, Donald C; Yeganova, Lana; Wilbur, W John; Lu, Zhiyong.

Bioinformatics ; 39(11)2023 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-37930897

RESUMO

MOTIVATION: Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. RESULTS: To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks. AVAILABILITY AND IMPLEMENTATION: The MedCPT code and model are available at https://github.com/ncbi/MedCPT.

Assuntos

Armazenamento e Recuperação da Informação , Semântica , Idioma , Processamento de Linguagem Natural , PubMed , Literatura de Revisão como Assunto

2.

Comprehensively identifying Long Covid articles with human-in-the-loop machine learning.

Leaman, Robert; Islamaj, Rezarta; Allot, Alexis; Chen, Qingyu; Wilbur, W John; Lu, Zhiyong.

Patterns (N Y) ; 4(1): 100659, 2023 Jan 13.

Artigo em Inglês | MEDLINE | ID: mdl-36471749

RESUMO

A significant percentage of COVID-19 survivors experience ongoing multisystemic symptoms that often affect daily living, a condition known as Long Covid or post-acute-sequelae of SARS-CoV-2 infection. However, identifying scientific articles relevant to Long Covid is challenging since there is no standardized or consensus terminology. We developed an iterative human-in-the-loop machine learning framework combining data programming with active learning into a robust ensemble model, demonstrating higher specificity and considerably higher sensitivity than other methods. Analysis of the Long Covid Collection shows that (1) most Long Covid articles do not refer to Long Covid by any name, (2) when the condition is named, the name used most frequently in the literature is Long Covid, and (3) Long Covid is associated with disorders in a wide variety of body systems. The Long Covid Collection is updated weekly and is searchable online at the LitCovid portal: https://www.ncbi.nlm.nih.gov/research/coronavirus/docsum?filters=e_condition.LongCovid.

3.

Towards a unified search: Improving PubMed retrieval with full text.

Kim, Won; Yeganova, Lana; Comeau, Donald C; Wilbur, W John; Lu, Zhiyong.

J Biomed Inform ; 134: 104211, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-36152950

RESUMO

OBJECTIVE: A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance. MATERIALS AND METHODS: For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness. RESULTS AND CONCLUSIONS: Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.

Assuntos

Gerenciamento de Dados , Armazenamento e Recuperação da Informação , PubMed

4.

Evolving use of ancestry, ethnicity, and race in genetics research-A survey spanning seven decades.

Byeon, Yen Ji Julia; Islamaj, Rezarta; Yeganova, Lana; Wilbur, W John; Lu, Zhiyong; Brody, Lawrence C; Bonham, Vence L.

Am J Hum Genet ; 108(12): 2215-2223, 2021 12 02.

Artigo em Inglês | MEDLINE | ID: mdl-34861173

RESUMO

To inform continuous and rigorous reflection about the description of human populations in genomics research, this study investigates the historical and contemporary use of the terms "ancestry," "ethnicity," "race," and other population labels in The American Journal of Human Genetics from 1949 to 2018. We characterize these terms' frequency of use and assess their odds of co-occurrence with a set of social and genetic topical terms. Throughout The Journal's 70-year history, "ancestry" and "ethnicity" have increased in use, appearing in 33% and 26% of articles in 2009-2018, while the use of "race" has decreased, occurring in 4% of articles in 2009-2018. Although its overall use has declined, the odds of "race" appearing in the presence of "ethnicity" has increased relative to the odds of occurring in its absence. Forms of population descriptors "Caucasian" and "Negro" have largely disappeared from The Journal (<1% of articles in 2009-2018). Conversely, the continental labels "African," "Asian," and "European" have increased in use and appear in 18%, 14%, and 42% of articles from 2009-2018, respectively. Decreasing uses of the terms "race," "Caucasian," and "Negro" are indicative of a transition away from the field's history of explicitly biological race science; at the same time, the increasing use of "ancestry," "ethnicity," and continental labels should serve to motivate ongoing reflection as the terminology used to describe genetic variation continues to evolve.

Assuntos

Pesquisa em Genética , Genética Humana/tendências , Etnicidade , Pesquisa em Genética/história , História do Século XX , História do Século XXI , Genética Humana/história , Humanos , Editoração/história , Grupos Raciais

5.

Better synonyms for enriching biomedical search.

Yeganova, Lana; Kim, Sun; Chen, Qingyu; Balasanov, Grigory; Wilbur, W John; Lu, Zhiyong.

J Am Med Inform Assoc ; 27(12): 1894-1902, 2020 12 09.

Artigo em Inglês | MEDLINE | ID: mdl-33083825

RESUMO

OBJECTIVE: In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be. MATERIALS AND METHODS: In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms. RESULTS: Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods. CONCLUSIONS: We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.

Assuntos

Pesquisa Biomédica , Armazenamento e Recuperação da Informação/métodos , Linguística , PubMed , Algoritmos , Probabilidade , Terminologia como Assunto

6.

PDC - a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed.

Islamaj, Rezarta; Yeganova, Lana; Kim, Won; Xie, Natalie; Wilbur, W John; Lu, Zhiyong.

AMIA Jt Summits Transl Sci Proc ; 2020: 259-268, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32477645

RESUMO

The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, and through this, to help create a rich environment of multifaceted data to guide health care researchers in their endeavor to better understand the breadth, depth and scope of the problem. We demonstrate the usefulness of the proposed algorithm by providing a web portal that allows mental health researchers to peruse the suicide-related literature in PubMed.

7.

Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.

Chen, Qingyu; Du, Jingcheng; Kim, Sun; Wilbur, W John; Lu, Zhiyong.

BMC Med Inform Decis Mak ; 20(Suppl 1): 73, 2020 04 30.

Artigo em Inglês | MEDLINE | ID: mdl-32349758

RESUMO

BACKGROUND: Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. METHODS: We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. RESULTS: The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 - the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. CONCLUSIONS: Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.

Assuntos

Aprendizado Profundo , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Aprendizado de Máquina , Mineração de Dados , Humanos , Idioma , PubMed

8.

PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database.

Islamaj, Rezarta; Wilbur, W John; Xie, Natalie; Gonzales, Noreen R; Thanki, Narmada; Yamashita, Roxanne; Zheng, Chanjuan; Marchler-Bauer, Aron; Lu, Zhiyong.

Database (Oxford) ; 20192019 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-31267135

RESUMO

This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms.

Assuntos

Algoritmos , Curadoria de Dados , Bases de Dados de Proteínas , Proteínas , PubMed , Alinhamento de Sequência , Domínios Proteicos , Proteínas/química , Proteínas/genética

9.

LitSense: making sense of biomedical literature at sentence level.

Allot, Alexis; Chen, Qingyu; Kim, Sun; Vera Alvarez, Roberto; Comeau, Donald C; Wilbur, W John; Lu, Zhiyong.

Nucleic Acids Res ; 47(W1): W594-W599, 2019 07 02.

Artigo em Inglês | MEDLINE | ID: mdl-31020319

RESUMO

Literature search is a routine practice for scientific studies as new discoveries build on knowledge from the past. Current tools (e.g. PubMed, PubMed Central), however, generally require significant effort in query formulation and optimization (especially in searching the full-length articles) and do not allow direct retrieval of specific statements, which is key for tasks such as comparing/validating new findings with previous knowledge and performing evidence attribution in biocuration. Thus, we introduce LitSense, which is the first web-based system that specializes in sentence retrieval for biomedical literature. LitSense provides unified access to PubMed and PMC content with over a half-billion sentences in total. Given a query, LitSense returns best-matching sentences using both a traditional term-weighting approach that up-weights sentences that contain more of the rare terms in the user query as well as a novel neural embedding approach that enables the retrieval of semantically relevant results without explicit keyword match. LitSense provides a user-friendly interface that assists its users to quickly browse the returned sentences in context and/or further filter search results by section or publication date. LitSense also employs PubTator to highlight biomedical entities (e.g. gene/proteins) in the sentences for better result visualization. LitSense is freely available at https://www.ncbi.nlm.nih.gov/research/litsense.

Assuntos

Mineração de Dados/métodos , Software , Indexação e Redação de Resumos , PubMed , Publicações

10.

A Field Sensor: computing the composition and intent of PubMed queries.

Yeganova, Lana; Kim, Won; Comeau, Donald C; Wilbur, W John; Lu, Zhiyong.

Database (Oxford) ; 20182018 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-30010750

RESUMO

PubMed® is a search engine providing access to a collection of over 27 million biomedical bibliographic records as of 2017. PubMed processes millions of queries a day, and understanding these queries is one of the main building blocks for successful information retrieval. In this work, we present Field Sensor, a domain-specific tool for understanding the composition and predicting the user intent of PubMed queries. Given a query, the Field Sensor infers a field for each token or sequence of tokens in a query in multi-step process that includes syntactic chunking, rule-based tagging and probabilistic field prediction. In this work, the fields of interest are those associated with (meta-)data elements of each PubMed record such as article title, abstract, author name(s), journal title, volume, issue, page and date. We evaluate the accuracy of our algorithm on a human-annotated corpus of 10 000 PubMed queries, as well as a new machine-annotated set of 103 000 PubMed queries. The Field Sensor achieves an accuracy of 93 and 91% on the two corresponding corpora and finds that nearly half of all searches are navigational (e.g. author searches, article title searches etc.) and half are informational (e.g. topical searches). The Field Sensor has been integrated into PubMed since June 2017 to detect informational queries for which results sorted by relevance can be suggested as an alternative to those sorted by the default date sort. In addition, the composition of PubMed queries as computed by the Field Sensor proves to be essential for understanding how users query PubMed.

Assuntos

Algoritmos , PubMed , Ferramenta de Busca , Curadoria de Dados , Publicações , Padrões de Referência

11.

Discovering themes in biomedical literature using a projection-based algorithm.

Yeganova, Lana; Kim, Sun; Balasanov, Grigory; Wilbur, W John.

BMC Bioinformatics ; 19(1): 269, 2018 07 16.

Artigo em Inglês | MEDLINE | ID: mdl-30012087

RESUMO

BACKGROUND: The need to organize any large document collection in a manner that facilitates human comprehension has become crucial with the increasing volume of information available. Two common approaches to provide a broad overview of the information space are document clustering and topic modeling. Clustering aims to group documents or terms into meaningful clusters. Topic modeling, on the other hand, focuses on finding coherent keywords for describing topics appearing in a set of documents. In addition, there have been efforts for clustering documents and finding keywords simultaneously. RESULTS: We present an algorithm to analyze document collections that is based on a notion of a theme, defined as a dual representation based on a set of documents and key terms. In this work, a novel vector space mechanism is proposed for computing themes. Starting with a single document, the theme algorithm treats terms and documents as explicit components, and iteratively uses each representation to refine the other until the theme is detected. The method heavily relies on an optimization routine that we refer to as the projection algorithm which, under specific conditions, is guaranteed to converge to the first singular vector of a data matrix. We apply our algorithm to a collection of about sixty thousand PubMed â documents examining the subject of Single Nucleotide Polymorphism, evaluate the results and show the effectiveness and scalability of the proposed method. CONCLUSIONS: This study presents a contribution on theoretical and algorithmic levels, as well as demonstrates the feasibility of the method for large scale applications. The evaluation of our system on benchmark datasets demonstrates that our method compares favorably with the current state-of-the-art methods in computing clusters of documents with coherent topic terms.

Assuntos

Algoritmos , Publicações , Análise por Conglomerados , Bases de Dados Genéticas , Humanos , Polimorfismo de Nucleotídeo Único/genética

12.

PubMed Phrases, an open set of coherent phrases for searching biomedical literature.

Kim, Sun; Yeganova, Lana; Comeau, Donald C; Wilbur, W John; Lu, Zhiyong.

Sci Data ; 5: 180104, 2018 06 12.

Artigo em Inglês | MEDLINE | ID: mdl-29893755

RESUMO

In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze and discuss the usage of these PubMed Phrases in literature search.

13.

Bridging the gap: Incorporating a semantic similarity measure for effectively mapping PubMed queries to documents.

Kim, Sun; Fiorini, Nicolas; Wilbur, W John; Lu, Zhiyong.

J Biomed Inform ; 75: 122-127, 2017 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-28986328

RESUMO

The main approach of traditional information retrieval (IR) is to examine how many words from a query appear in a document. A drawback of this approach, however, is that it may fail to detect relevant documents where no or only few words from a query are found. The semantic analysis methods such as LSA (latent semantic analysis) and LDA (latent Dirichlet allocation) have been proposed to address the issue, but their performance is not superior compared to common IR approaches. Here we present a query-document similarity measure motivated by the Word Mover's Distance. Unlike other similarity measures, the proposed method relies on neural word embeddings to compute the distance between words. This process helps identify related words when no direct matches are found between a query and a document. Our method is efficient and straightforward to implement. The experimental results on TREC Genomics data show that our approach outperforms the BM25 ranking function by an average of 12% in mean average precision. Furthermore, for a real-world dataset collected from the PubMed® search logs, we combine the semantic measure with BM25 using a learning to rank method, which leads to improved ranking scores by up to 25%. This experiment demonstrates that the proposed approach and BM25 nicely complement each other and together produce superior performance.

Assuntos

Armazenamento e Recuperação da Informação , PubMed , Semântica

14.

The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.

Islamaj Dogan, Rezarta; Kim, Sun; Chatr-Aryamontri, Andrew; Chang, Christie S; Oughtred, Rose; Rust, Jennifer; Wilbur, W John; Comeau, Donald C; Dolinski, Kara; Tyers, Mike.

Database (Oxford) ; 20172017.

Artigo em Inglês | MEDLINE | ID: mdl-28077563

RESUMO

A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein-protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report.Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html.

Assuntos

Curadoria de Dados/métodos , Mineração de Dados/métodos , Bases de Dados Genéticas , Proteínas/genética , Proteínas/metabolismo

15.

BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID.

Kim, Sun; Islamaj Dogan, Rezarta; Chatr-Aryamontri, Andrew; Chang, Christie S; Oughtred, Rose; Rust, Jennifer; Batista-Navarro, Riza; Carter, Jacob; Ananiadou, Sophia; Matos, Sérgio; Santos, André; Campos, David; Oliveira, José Luís; Singh, Onkar; Jonnagaddala, Jitendra; Dai, Hong-Jie; Su, Emily Chia-Yu; Chang, Yung-Chun; Su, Yu-Chen; Chu, Chun-Han; Chen, Chien Chin; Hsu, Wen-Lian; Peng, Yifan; Arighi, Cecilia; Wu, Cathy H; Vijay-Shanker, K; Aydin, Ferhat; Hüsünbeyi, Zehra Melce; Özgür, Arzucan; Shin, Soo-Yong; Kwon, Dongseop; Dolinski, Kara; Tyers, Mike; Wilbur, W John; Comeau, Donald C.

Database (Oxford) ; 20162016.

Artigo em Inglês | MEDLINE | ID: mdl-27589962

RESUMO

BioC is a simple XML format for text, annotations and relations, and was developed to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a collaborative task to build an assistant system for BioGRID curation. In this paper, we describe the framework of the collaborative BioC task and discuss our findings based on the user survey. This track consisted of eight subtasks including gene/protein/organism named entity recognition, protein-protein/genetic interaction passage identification and annotation visualization. Using BioC as their data-sharing and communication medium, nine teams, world-wide, participated and contributed either new methods or improvements of existing tools to address different subtasks of the BioC track. Results from different teams were shared in BioC and made available to other teams as they addressed different subtasks of the track. In the end, all submitted runs were merged using a machine learning classifier to produce an optimized output. The biocurator assistant system was evaluated by four BioGRID curators in terms of practical usability. The curators' feedback was overall positive and highlighted the user-friendly design and the convenient gene/protein curation tool based on text mining.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-1-bioc/.

Assuntos

Curadoria de Dados/métodos , Mineração de Dados/métodos , Processamento Eletrônico de Dados/métodos , Disseminação de Informação/métodos

16.

BioC viewer: a web-based tool for displaying and merging annotations in BioC.

Shin, Soo-Yong; Kim, Sun; Wilbur, W John; Kwon, Dongseop.

Database (Oxford) ; 20162016.

Artigo em Inglês | MEDLINE | ID: mdl-27515823

RESUMO

BioC is an XML-based format designed to provide interoperability for text mining tools and manual curation results. A challenge of BioC as a standard format is to align annotations from multiple systems. Ideally, this should not be a major problem if users follow guidelines given by BioC key files. Nevertheless, the misalignment between text and annotations happens quite often because different systems tend to use different software development environments, e.g. ASCII vs. Unicode. We first implemented the BioC Viewer to assist BioGRID curators as a part of the BioCreative V BioC track (Collaborative Biocurator Assistant Task). For the BioC track, the BioC Viewer helped curate protein-protein interaction and genetic interaction pairs appearing in full-text articles. Here, we describe the BioC Viewer itself as well as improvements made to the BioC Viewer since the BioCreative V Workshop to address the misalignment issue of BioC annotations. While uploading BioC files, a BioC merge process is offered when there are files from the same full-text article. If there is a mismatch between an annotated offset and text, the BioC Viewer adjusts the offset to correctly align with the text. The BioC Viewer has a user-friendly interface, where most operations can be performed within a few mouse clicks. The feedback from BioGRID curators has been positive for the web interface, particularly for its usability and learnability.Database URL: http://viewer.bioqrator.org.

Assuntos

Curadoria de Dados/métodos , Mineração de Dados/métodos , Internet , Interface Usuário-Computador

17.

Meshable: searching PubMed abstracts by utilizing MeSH and MeSH-derived topical terms.

Kim, Sun; Yeganova, Lana; Wilbur, W John.

Bioinformatics ; 32(19): 3044-6, 2016 10 01.

Artigo em Inglês | MEDLINE | ID: mdl-27288493

RESUMO

UNLABELLED: Medical Subject Headings (MeSH(®)) is a controlled vocabulary for indexing and searching biomedical literature. MeSH terms and subheadings are organized in a hierarchical structure and are used to indicate the topics of an article. Biologists can use either MeSH terms as queries or the MeSH interface provided in PubMed(®) for searching PubMed abstracts. However, these are rarely used, and there is no convenient way to link standardized MeSH terms to user queries. Here, we introduce a web interface which allows users to enter queries to find MeSH terms closely related to the queries. Our method relies on co-occurrence of text words and MeSH terms to find keywords that are related to each MeSH term. A query is then matched with the keywords for MeSH terms, and candidate MeSH terms are ranked based on their relatedness to the query. The experimental results show that our method achieves the best performance among several term extraction approaches in terms of topic coherence. Moreover, the interface can be effectively used to find full names of abbreviations and to disambiguate user queries. AVAILABILITY AND IMPLEMENTATION: https://www.ncbi.nlm.nih.gov/IRET/MESHABLE/ CONTACT: sun.kim@nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

MEDLINE , Medical Subject Headings , PubMed , Ferramenta de Busca , Indexação e Redação de Resumos

18.

Identifying named entities from PubMed for enriching semantic categories.

Kim, Sun; Lu, Zhiyong; Wilbur, W John.

BMC Bioinformatics ; 16: 57, 2015 Feb 21.

Artigo em Inglês | MEDLINE | ID: mdl-25887671

RESUMO

BACKGROUND: Controlled vocabularies such as the Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE. RESULTS: We here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, "gene", "protein", "disease", "cell" and "cells". CONCLUSIONS: Although biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature.

Assuntos

Processamento de Linguagem Natural , PubMed , Semântica , Unified Medical Language System , Vocabulário Controlado , Inteligência Artificial , Humanos , Medical Subject Headings

19.

Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach.

Kim, Sun; Liu, Haibin; Yeganova, Lana; Wilbur, W John.

J Biomed Inform ; 55: 23-30, 2015 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-25796456

RESUMO

Identifying unknown drug interactions is of great benefit in the early detection of adverse drug reactions. Despite existence of several resources for drug-drug interaction (DDI) information, the wealth of such information is buried in a body of unstructured medical text which is growing exponentially. This calls for developing text mining techniques for identifying DDIs. The state-of-the-art DDI extraction methods use Support Vector Machines (SVMs) with non-linear composite kernels to explore diverse contexts in literature. While computationally less expensive, linear kernel-based systems have not achieved a comparable performance in DDI extraction tasks. In this work, we propose an efficient and scalable system using a linear kernel to identify DDI information. The proposed approach consists of two steps: identifying DDIs and assigning one of four different DDI types to the predicted drug pairs. We demonstrate that when equipped with a rich set of lexical and syntactic features, a linear SVM classifier is able to achieve a competitive performance in detecting DDIs. In addition, the one-against-one strategy proves vital for addressing an imbalance issue in DDI type classification. Applied to the DDIExtraction 2013 corpus, our system achieves an F1 score of 0.670, as compared to 0.651 and 0.609 reported by the top two participating teams in the DDIExtraction 2013 challenge, both based on non-linear kernel methods.

Assuntos

Mineração de Dados/métodos , Interações Medicamentosas , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Máquina de Vetores de Suporte , Vocabulário Controlado , Reconhecimento Automatizado de Padrão/métodos , Semântica

20.

Assisting manual literature curation for protein-protein interactions using BioQRator.

Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong; Chatr-aryamontri, Andrew; Wilbur, W John.

Database (Oxford) ; 20142014.

Artigo em Inglês | MEDLINE | ID: mdl-25052701

RESUMO

The time-consuming nature of manual curation and the rapid growth of biomedical literature severely limit the number of articles that database curators can scrutinize and annotate. Hence, semi-automatic tools can be a valid support to increase annotation throughput. Although a handful of curation assistant tools are already available, to date, little has been done to formally evaluate their benefit to biocuration. Moreover, most curation tools are designed for specific problems. Thus, it is not easy to apply an annotation tool for multiple tasks. BioQRator is a publicly available web-based tool for annotating biomedical literature. It was designed to support general tasks, i.e. any task annotating entities and relationships. In the BioCreative IV edition, BioQRator was tailored for protein- protein interaction (PPI) annotation by migrating information from PIE the search. The results obtained from six curators showed that the precision on the top 10 documents doubled with PIE the search compared with PubMed search results. It was also observed that the annotation time for a full PPI annotation task decreased for a beginner-intermediate level annotator. This finding is encouraging because text-mining techniques were not directly involved in the full annotation task and BioQRator can be easily integrated with any text-mining resources. Database URL: http://www.bioqrator.org/.

Assuntos

Curadoria de Dados/métodos , Mineração de Dados/métodos , Internet , Mapeamento de Interação de Proteínas/métodos , Software , Humanos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA